2025-05-28-00-00
WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference
Abstract
arXiv:2505.19427v1 Announce Type: cross Abstract: The growing computational demands of large language models (LLMs) make efficient inference and activation strategies increasingly critical. While recent approaches, such as Mixture-of-Experts (MoE), leverage selective activation but require specialized training, training-free sparse activation methods offer broader applicability and superior resource efficiency through their plug-and-play design. However, many existing methods rely solely on hidden state magnitudes to determine activation, resulting in high approximation errors and suboptimal inference accuracy. To address these limitations, we propose WINA (Weight Informed Neuron Activation), a novel, simple, and training-free sparse activation framework that jointly considers hidden state magnitudes and the column-wise -norms of weight matrices. We show that this leads to a sparsification strategy that obtains optimal approximation error bounds with theoretical guarantees tighter than existing techniques. Empirically, WINA also outperforms state-of-the-art methods (e.g., TEAL) by up to in average performance at the same sparsity levels, across a diverse set of LLM architectures and datasets. These results position WINA as a new performance frontier for training-free sparse activation in LLM inference, advancing training-free sparse activation methods and setting a robust baseline for efficient inference. The source code is available at https://github.com/microsoft/wina.
摘要
随着大型语言模型(LLMs)计算需求的日益增长,高效的推理与激活策略变得愈发关键。尽管近期提出的混合专家(MoE)等方法通过选择性激活实现了性能优化,但这些方法需要专门的训练过程。相比之下,无需训练的稀疏激活方法凭借即插即用特性,具有更广泛的适用性和更优越的资源效率。然而,现有方法大多仅依据隐藏状态幅值来决定激活,导致较高的近似误差和次优的推理准确率。为解决这些局限,我们提出WINA(权重感知神经元激活)——一种新颖、简单且无需训练的稀疏激活框架,该方法联合考虑隐藏状态幅值和权重矩阵列向量的ℓ2范数。我们证明该稀疏化策略能获得最优的近似误差界,其理论保证比现有技术更为严格。实证研究表明,在相同稀疏度下,WINA在多种LLM架构和数据集上的平均性能比最先进方法(如TEAL)最高可提升2.94%。这些成果使WINA成为LLM推理中无需训练稀疏激活的新性能标杆,推动了无需训练稀疏激活方法的发展,并为高效推理建立了稳健的基准。